ROMBAC: The Romanian Balanced Annotated Corpus

نویسندگان

  • Radu Ion
  • Elena Irimia
  • Dan Stefanescu
  • Dan Tufis
چکیده

This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introducing a Romanian Frequency List and the Romanian Vocabulary Levels Test

Vocabulary is considered essential to language learning, thus English word lists and tests based on frequency information have become the centre of attention for researchers, teachers and learners alike. As a result, it is argued hereby that frequency based word lists and tests should be adapted and regarded as key elements for teaching and learning Romanian as an additional language as well. S...

متن کامل

Romanian TimeBank: An Annotated Parallel Corpus for Temporal Information

The paper describes the main steps for the construction, annotation and validation of the Romanian version of the TimeBank corpus. Starting from the English TimeBank corpus – the reference annotated corpus in the temporal domain, we have translated all the 183 English news texts into Romanian and mapped the English annotations onto Romanian, with a success rate of 96.53%. Based on ISO-Time the ...

متن کامل

Resolving Romanian Zero Pronouns: A Machine Learning Approach

This paper presents a new study on the distribution, identification, and resolution of zero pronouns in Romanian. A Romanian corpus, including legal, encyclopaedic, literary, and news texts has been created and manually annotated for zero pronouns. Using a morphological parser for Romanian and machine learning methods, experiments were performed on the created corpus for the identification and ...

متن کامل

Zero Pronominal Anaphora Resolution for the Romanian Language

This paper presents a new study on the distribution, identification, and resolution of zero pronouns in Romanian. A Romanian corpus, including legal, encyclopaedic, literary, and news texts has been created and manually annotated for zero pronouns. Using a morphological parser for Romanian and machine learning methods, experiments were performed on the created corpus for the identification and ...

متن کامل

A Romanian Corpus for Speech Perception and Automatic Speech Recognition

A speech corpus is available in Romanian to use as the common material in speech perception and automatic speech recognition. It consists of high-quality audio of 400 sentences spoken by each of 12 speakers. Utterances are simple, syntactically identical phrases such as “muta bronz cu p 2 agale.” Preliminary intelligibility tests using the audio signals suggest that the collected speech is easi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012